(Artificial) Neural Networks (ANN)


By Prof. Seungchul Lee
http://iai.postech.ac.kr/
Industrial AI Lab at POSTECH

Table of Contents

1. Recall Supervised Learning Setup¶

Perceptron


XOR Problem

  • Minsky-Papert Controversy on XOR
    • not linearly separable
    • limitation of perceptron
$x_1$ $x_2$ $x_1$ XOR $x_2$
0 0 0
0 1 1
1 0 1
1 1 0



2. From Perceptron to Multi-Layer Perceptron (MLP)¶

2.1. Perceptron for $h_{\omega}(x)$¶

  • Neurons compute the weighted sum of their inputs

  • A neuron is activated or fired when the sum $a$ is positive


$$ \begin{align*} a &= \omega_0 + \omega_1 x_1 + \omega_2 x_2 \\ \\ \hat{y} &= g(a) = \begin{cases} 1 & a > 0\\ 0 & \text{otherwise} \end{cases} \end{align*} $$



  • A step function is not differentiable


  • One layer is often not enough
    • One hyperplane

2.2. Multi-layer Perceptron = Artificial Neural Networks (ANN)¶

Multi-neurons



Differentiable activation function




In a compact representation




Multi-layer perceptron


2.3. Another Perspective: ANN as Kernel Learning¶

We can represent this ā€œneuronā€ as follows:

  • The main weakness of linear predictors is their lack of capacity. For classification, the populations have to be linearly separable.

  • The XOR example can be solved by pre-processing the data to make the two populations linearly separable.


Universal function approximator Universal function classifier

Parameterized


Example: Linear Classifier

  • Perceptron tries to separate the two classes of data by dividing them with a line


Example: Neural Networks

  • The hidden layer learns a representation so that the data is linearly separable


colah's blog

3. Logistic Regression¶

3.1. Logistic Regression with TensorFlow¶

$$y^{(i)} \in \{1,0\}$$
InĀ [1]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import time
%matplotlib inline
InĀ [2]:
#training data gerneration
m = 1000
x1 = 8*np.random.rand(m, 1)
x2 = 7*np.random.rand(m, 1) - 4

g = 0.8*x1 + x2 - 3

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([np.ones([N,1]), x1[C1], x2[C1]])
X0 = np.hstack([np.ones([M,1]), x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_y = np.vstack([np.ones([N,1]), -np.ones([M,1])])

train_X = np.asmatrix(train_X)
train_y = np.asmatrix(train_y)

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.show()
InĀ [3]:
train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])
train_y = np.asmatrix(train_y)
InĀ [4]:
import tensorflow as tf

LR = 0.05
n_iter = 15000

x = tf.placeholder(tf.float32, [None, 3])
y = tf.placeholder(tf.float32, [None, 1])

w = tf.Variable(tf.random_normal([3,1]))

y_pred = tf.matmul(x,w)
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits = y_pred, labels = y)
loss = tf.reduce_mean(loss)

optm = tf.train.GradientDescentOptimizer(LR).minimize(loss)
init = tf.global_variables_initializer()

with tf.Session() as sess:
    sess.run(init)
    for epoch in range(n_iter):                                                                         
        sess.run(optm, feed_dict = {x: train_X, y: train_y})          
    
    w_hat = sess.run(w)
WARNING: Logging before flag parsing goes to stderr.
W0103 23:59:38.471952 44864 deprecation.py:323] From c:\users\user\appdata\local\programs\python\python37\lib\site-packages\tensorflow\python\ops\nn_impl.py:180: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
InĀ [5]:
x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w_hat[1,0]/w_hat[2,0]*x1p - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()

3.2. Logistic Regression in a Form of Neural Network¶



$$y = \sigma \,(\omega_0 + \omega_1 x_1 + \omega_2 x_2)$$




InĀ [6]:
# define input and output size

n_input = 3
n_output = 1
InĀ [7]:
# define weights as a dictionary 

weights = {
    'output' : tf.Variable(tf.random_normal([n_input, n_output], stddev = 0.1))
}
InĀ [8]:
# define placeholders for train_x and train_y

x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])
InĀ [9]:
# define network architecture

def build_model(x, weights):   
    output = tf.matmul(x, weights['output'])    
    return output
InĀ [10]:
# define loss

pred = build_model(x, weights)
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits = pred, labels = y)
loss = tf.reduce_mean(loss)
InĀ [11]:
LR = 0.05
optm = tf.train.GradientDescentOptimizer(LR).minimize(loss)
InĀ [12]:
n_batch = 50     # Batch size
n_iter = 15000   # Learning iteration
n_prt = 250      # Print cycle

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

# training or learning

loss_record = []
for epoch in range(n_iter):
    sess.run(optm, feed_dict = {x: train_X,  y: train_y})    
    if epoch % n_prt == 0:
        loss_record.append(sess.run(loss, feed_dict = {x: train_X,  y: train_y}))
        
w_hat = sess.run(weights['output'])
InĀ [13]:
plt.figure(figsize=(10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
InĀ [14]:
x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w_hat[1,0]/w_hat[2,0]*x1p - w_hat[0,0]/w_hat[2,0]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()

Weights and Bias

  • In a neural network, weights and biases are typically separated.



$$ \begin{align*} y_j &= \left(\sum\limits_i \omega_{ij}x_i\right) + b_j\\ y &= \omega^T \mathcal{x} + \mathcal{b} \end{align*} $$



$$y = \sigma \,(b + \omega_1 x_1 + \omega_2 x_2)$$





InĀ [15]:
n_input = 2
n_output = 1
InĀ [16]:
train_X = train_X[:,1:3]
InĀ [17]:
# define network

def build_model(x, weights, biases):   
    output = tf.add(tf.matmul(x, weights['output']), biases['output'])
    return output
InĀ [18]:
weights = {
    'output' : tf.Variable(tf.random_normal([n_input, n_output], stddev = 0.1))
}

biases = {
    'output' : tf.Variable(tf.random_normal([n_output], stddev = 0.1))
}

x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])

pred = build_model(x, weights, biases)
loss = tf.nn.sigmoid_cross_entropy_with_logits(logits=pred, labels=y)
loss = tf.reduce_mean(loss)

LR = 0.05
optm = tf.train.GradientDescentOptimizer(LR).minimize(loss)

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

n_batch = 50     
n_iter = 15000   
n_prt = 250      

loss_record = []
for epoch in range(n_iter):
    sess.run(optm, feed_dict = {x: train_X,  y: train_y})     
    if epoch % n_prt == 0:
        loss_record.append(sess.run(loss, feed_dict = {x: train_X,  y: train_y}))
        
w_hat = sess.run(weights['output'])
b_hat = sess.run(biases['output'])

plt.figure(figsize=(10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
InĀ [19]:
x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w_hat[0,0]/w_hat[1,0]*x1p - b_hat[0]/w_hat[1,0]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()

One-hot Encoding

  • One-hot encoding is a conventional practice for a multi-class classification


$$y^{(i)} \in \{1,0\} \quad \implies \quad y^{(i)} \in \{[0,1],[1,0]\}$$

  • tf.nn.sigmoid_cross_entropy_with_logits $\rightarrow$ tf.nn.softmax_cross_entropy_with_logits
InĀ [20]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(handle_unknown='ignore')
train_y = ohe.fit_transform(train_y).toarray()
print(train_y)
[[0. 1.]
 [0. 1.]
 [0. 1.]
 ...
 [1. 0.]
 [1. 0.]
 [1. 0.]]




InĀ [21]:
n_input = 2
n_output = 2
InĀ [22]:
weights = {
    'output' : tf.Variable(tf.random_normal([n_input, n_output], stddev = 0.1))
}

biases = {
    'output' : tf.Variable(tf.random_normal([n_output], stddev = 0.1))
}

x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])

pred = build_model(x, weights, biases)
loss = tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y)
loss = tf.reduce_mean(loss)

LR = 0.05
optm = tf.train.GradientDescentOptimizer(LR).minimize(loss)

sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

n_batch = 50     
n_iter = 15000   
n_prt = 250      

loss_record = []
for epoch in range(n_iter):
    sess.run(optm, feed_dict = {x: train_X,  y: train_y})     
    if epoch % n_prt == 0:
        loss_record.append(sess.run(loss, feed_dict = {x: train_X,  y: train_y}))
        
w_hat = sess.run(weights['output'])
b_hat = sess.run(biases['output'])

plt.figure(figsize=(10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
W0103 23:59:51.091804 44864 deprecation.py:323] From <ipython-input-22-3e43a3e8cf6d>:13: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.
Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See `tf.nn.softmax_cross_entropy_with_logits_v2`.

InĀ [23]:
print(w_hat)

x1p = np.arange(0, 8, 0.01).reshape(-1, 1)
x2p = - w_hat[0,0]/w_hat[1,0]*x1p - b_hat[0]/w_hat[1,0]
x3p = - w_hat[0,1]/w_hat[1,1]*x1p - b_hat[1]/w_hat[1,1]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlim([0, 8])
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 12)
plt.show()
[[-1.742706   1.7511731]
 [-2.2152994  2.2633781]]

4. Looking at Parameters¶

  • To understand network's behavior

4.1. Multi-Layers¶

InĀ [24]:
# training data gerneration

m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4

g = - 0.5*(x1-1)**2 + 2*x2 + 5

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)

train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])
ohe = OneHotEncoder(handle_unknown='ignore')
train_y = ohe.fit_transform(train_y).toarray()

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()




InĀ [25]:
n_input = 2
n_hidden = 2
n_output = 2
InĀ [26]:
weights = {
    'hidden' : tf.Variable(tf.random_normal([n_input, n_hidden], stddev = 0.1)),
    'output' : tf.Variable(tf.random_normal([n_hidden, n_output], stddev = 0.1))
}

biases = {
    'hidden' : tf.Variable(tf.random_normal([n_hidden], stddev = 0.1)),
    'output' : tf.Variable(tf.random_normal([n_output], stddev = 0.1))
}

x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])



InĀ [27]:
def build_model(x, weights, biases):
    hidden = tf.add(tf.matmul(x, weights['hidden']), biases['hidden'])
    hidden = tf.nn.sigmoid(hidden)
    
    output = tf.add(tf.matmul(hidden, weights['output']), biases['output'])  
    return output
InĀ [28]:
pred = build_model(x, weights, biases)
loss = tf.nn.softmax_cross_entropy_with_logits(logits = pred, labels = y)
loss = tf.reduce_mean(loss)

LR = 0.01
optm = tf.train.GradientDescentOptimizer(LR).minimize(loss)

sess = tf.Session()

init = tf.global_variables_initializer()
sess.run(init)

n_batch = 50     
n_iter = 50000   
n_prt = 250      

loss_record = []
for epoch in range(n_iter):
    sess.run(optm, feed_dict = {x: train_X,  y: train_y})     
    if epoch % n_prt == 0:
        loss_record.append(sess.run(loss, feed_dict = {x: train_X,  y: train_y}))
        
w_hat = sess.run(weights)
b_hat = sess.run(biases)

plt.figure(figsize=(10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
InĀ [29]:
H = train_X*w_hat['hidden'] + b_hat['hidden']
H = 1/(1 + np.exp(-H))
InĀ [30]:
plt.figure(figsize=(10, 8))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
InĀ [31]:
x1p = np.arange(0, 1, 0.01).reshape(-1, 1)
x2p = - w_hat['output'][0,0]/w_hat['output'][1,0]*x1p - b_hat['output'][0]/w_hat['output'][1,0]
x3p = - w_hat['output'][0,1]/w_hat['output'][1,1]*x1p - b_hat['output'][1]/w_hat['output'][1,1]

plt.figure(figsize=(10, 8))
plt.plot(H[0:N,0], H[0:N,1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(H[N:m,0], H[N:m,1], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$z_1$', fontsize = 15)
plt.ylabel('$z_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
InĀ [32]:
x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w_hat['hidden'][0,0]/w_hat['hidden'][1,0]*x1p - b_hat['hidden'][0]/w_hat['hidden'][1,0]
x3p = - w_hat['hidden'][0,1]/w_hat['hidden'][1,1]*x1p - b_hat['hidden'][1]/w_hat['hidden'][1,1]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()

4.2. Multi-Neurons¶

InĀ [33]:
# training data gerneration

m = 1000
x1 = 10*np.random.rand(m, 1) - 5
x2 = 8*np.random.rand(m, 1) - 4

g = - 0.5*(x1*x2-1)**2 + 2*x2 + 5

C1 = np.where(g >= 0)[0]
C0 = np.where(g < 0)[0]
N = C1.shape[0]
M = C0.shape[0]
m = N + M

X1 = np.hstack([x1[C1], x2[C1]])
X0 = np.hstack([x1[C0], x2[C0]])

train_X = np.vstack([X1, X0])
train_X = np.asmatrix(train_X)

train_y = np.vstack([np.ones([N,1]), np.zeros([M,1])])
ohe = OneHotEncoder(handle_unknown='ignore')
train_y = ohe.fit_transform(train_y).toarray()

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.legend(loc = 1, fontsize = 15)
plt.xlabel(r'$x_1$', fontsize = 15)
plt.ylabel(r'$x_2$', fontsize = 15)
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()




InĀ [34]:
n_input = 2
n_hidden = 4
n_output = 2
InĀ [35]:
def build_model(x, weights, biases):    
    hidden = tf.add(tf.matmul(x, weights['hidden']), biases['hidden'])    
    hidden = tf.nn.sigmoid(hidden)
    
    output = tf.add(tf.matmul(hidden, weights['output']), biases['output'])    
    return output
InĀ [36]:
weights = {
    'hidden' : tf.Variable(tf.random_normal([n_input, n_hidden], stddev = 0.1)),
    'output' : tf.Variable(tf.random_normal([n_hidden, n_output], stddev = 0.1))
}

biases = {
    'hidden' : tf.Variable(tf.random_normal([n_hidden], stddev = 0.1)),
    'output' : tf.Variable(tf.random_normal([n_output], stddev = 0.1))
}

x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])
InĀ [37]:
pred = build_model(x, weights, biases)
loss = tf.nn.softmax_cross_entropy_with_logits(logits = pred, labels = y)
loss = tf.reduce_mean(loss)

LR = 0.01
optm = tf.train.GradientDescentOptimizer(LR).minimize(loss)

sess = tf.Session()

init = tf.global_variables_initializer()
sess.run(init)

n_batch = 50     
n_iter = 80000   
n_prt = 250      

# Training cycle
loss_record = []
for epoch in range(n_iter):
    sess.run(optm, feed_dict = {x: train_X,  y: train_y})     
    if epoch % n_prt == 0:
        loss_record.append(sess.run(loss, feed_dict = {x: train_X,  y: train_y}))
        
w_hat = sess.run(weights)
b_hat = sess.run(biases)

# plots
plt.figure(figsize=(10,8))
plt.plot(np.arange(len(loss_record))*n_prt, loss_record)
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.show()
InĀ [38]:
x1p = np.arange(-5, 5, 0.01).reshape(-1, 1)
x2p = - w_hat['hidden'][0,0]/w_hat['hidden'][1,0]*x1p - b_hat['hidden'][0]/w_hat['hidden'][1,0]
x3p = - w_hat['hidden'][0,1]/w_hat['hidden'][1,1]*x1p - b_hat['hidden'][1]/w_hat['hidden'][1,1]
x4p = - w_hat['hidden'][0,2]/w_hat['hidden'][1,2]*x1p - b_hat['hidden'][2]/w_hat['hidden'][1,2]
x5p = - w_hat['hidden'][0,3]/w_hat['hidden'][1,3]*x1p - b_hat['hidden'][3]/w_hat['hidden'][1,3]

plt.figure(figsize=(10, 8))
plt.plot(x1[C1], x2[C1], 'ro', alpha = 0.4, label = 'C1')
plt.plot(x1[C0], x2[C0], 'bo', alpha = 0.4, label = 'C0')
plt.plot(x1p, x2p, 'k', linewidth = 3, label = '')
plt.plot(x1p, x3p, 'g', linewidth = 3, label = '')
plt.plot(x1p, x4p, 'm', linewidth = 3, label = '')
plt.plot(x1p, x5p, 'c', linewidth = 3, label = '')
plt.xlabel('$x_1$', fontsize = 15)
plt.xlabel('$x_1$', fontsize = 15)
plt.ylabel('$x_2$', fontsize = 15)
plt.legend(loc = 1, fontsize = 15)
plt.axis('equal')
plt.xlim([-5, 5])
plt.ylim([-4, 4])
plt.show()

5. Artificial Neural Networks¶

  • Complex/Nonlinear universal function approximator
    • Linearly connected networks
    • Simple nonlinear neurons
  • Hidden layers
    • Autonomous feature learning



5.1. Recursive Algorithm¶

  • One of the central ideas of computer science

  • Depends on solutions to smaller instances of the same problem ( = subproblem)

  • Function to call itself (it is impossible in the real world)



  • Factorial example


$$n ! = n \cdot (n-1) \cdots 2 \cdot 1$$

InĀ [39]:
n = 5

m = 1
for i in range(n):
    m = m*(i+1)
    
print(m)
120
InĀ [40]:
def fac(n):
    if n == 1:
        return 1
    else:
        return n*fac(n-1)    
InĀ [41]:
# recursive

fac(5)
Out[41]:
120

5.2. Dynamic Programming¶

  • Dynamic Programming: general, powerful algorithm design technique

  • Fibonacci numbers:

InĀ [42]:
# naive Fibonacci

def fib(n):
    if n <= 2:
        return 1
    else:
        return fib(n-1) + fib(n-2)    
InĀ [43]:
fib(10)
Out[43]:
55
InĀ [44]:
# Memorized DP Fibonacci

def mfib(n):
    global memo
        
    if memo[n-1] != 0:
        return memo[n-1]
    elif n <= 2:
        memo[n-1] = 1
        return memo[n-1]
    else:
        memo[n-1] = mfib(n-1) + mfib(n-2)
        return memo[n-1]
InĀ [45]:
import numpy as np

n = 10
memo = np.zeros(n)
mfib(n)
Out[45]:
55.0
InĀ [46]:
n = 30
%timeit fib(30)
166 ms ± 508 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
InĀ [47]:
memo = np.zeros(n)
%timeit mfib(30)
410 ns ± 1.39 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

5.3. Training Neural Networks¶

$=$ Learning or estimating weights and biases of multi-layer perceptron from training data

5.3.1. Optimization¶

3 key components

  1. objective function $f(\cdot)$
  2. decision variable or unknown $\omega$
  3. constraints $g(\cdot)$

In mathematical expression



$$\begin{align*} \min_{\omega} \quad &f(\omega) \end{align*} $$

5.3.2. Loss Function¶

  • Measures error between target values and predictions


$$ \min_{\omega} \sum_{i=1}^{m}\ell\left( h_{\omega}\left(x^{(i)}\right),y^{(i)}\right)$$

  • Example
    • Squared loss (for regression): $$ \frac{1}{m} \sum_{i=1}^{m} \left(h_{\omega}\left(x^{(i)}\right) - y^{(i)}\right)^2 $$
    • Cross entropy (for classification): $$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log\left(h_{\omega}\left(x^{(i)}\right)\right) + \left(1-y^{(i)}\right)\log\left(1-h_{\omega}\left(x^{(i)}\right)\right)$$

5.3.3. Learning¶

Learning weights and biases from data using gradient descent


$$\omega \Leftarrow \omega - \alpha \nabla_{\omega} \ell \left( h_{\omega}\left(x^{(i)}\right), y^{(i)} \right)$$
  • $\frac{\partial \ell}{\partial \omega}$: too many computations are required for all $\omega$
  • Structural constraints of NN:
    • Composition of functions
    • Chain rule
    • Dynamic programming


Backpropagation

  • Forward propagation
    • the initial information propagates up to the hidden units at each layer and finally produces output
  • Backpropagation
    • allows the information from the cost to flow backwards through the network in order to compute the gradients
  • Chain Rule

    • Computing the derivative of the composition of functions

      • $\space f(g(x))' = f'(g(x))g'(x)$

      • $\space {dz \over dx} = {dz \over dy} \bullet {dy \over dx}$

      • $\space {dz \over dw} = ({dz \over dy} \bullet {dy \over dx}) \bullet {dx \over dw}$

      • $\space {dz \over du} = ({dz \over dy} \bullet {dy \over dx} \bullet {dx \over dw}) \bullet {dw \over du}$

  • Backpropagation

    • Update weights recursively with memory

Optimization procedure


  • It is not easy to numerically compute gradients in network in general.
    • The good news: people have already done all the "hardwork" of developing numerical solvers (or libraries)
    • There are a wide range of tools: TensorFlow

Summary

  • Learning weights and biases from data using gradient descent


6. ANN with MNIST¶

6.1. What's an MNIST?¶

From Wikipedia

  • The MNIST database (Mixed National Institute of Standards and Technology database) is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, NIST's complete dataset was too hard.
  • MNIST (Mixed National Institute of Standards and Technology database) database
    • Handwritten digit database
    • $28 \times 28$ gray scaled image
    • Flattened matrix into a vector of $28 \times 28 = 784$



More here

We will be using MNIST to create a Multinomial Classifier that can detect if the MNIST image shown is a member of class 0,1,2,3,4,5,6,7,8 or 9. Susinctly, we're teaching a computer to recognize hand written digets.

InĀ [48]:
# import os
# os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
# os.environ["CUDA_VISIBLE_DEVICES"]="0"
InĀ [49]:
# Import Library
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

Let's download and load the dataset.

InĀ [50]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
W0104 00:00:58.542316 44864 deprecation.py:323] From <ipython-input-50-8bf8ae5a5303>:2: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
W0104 00:00:58.542316 44864 deprecation.py:323] From c:\users\user\appdata\local\programs\python\python37\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Please write your own downloading logic.
W0104 00:00:58.543314 44864 deprecation.py:323] From c:\users\user\appdata\local\programs\python\python37\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-images-idx3-ubyte.gz
W0104 00:00:58.769737 44864 deprecation.py:323] From c:\users\user\appdata\local\programs\python\python37\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.data to implement this functionality.
W0104 00:00:58.774694 44864 deprecation.py:323] From c:\users\user\appdata\local\programs\python\python37\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\mnist.py:110: dense_to_one_hot (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tf.one_hot on tensors.
W0104 00:00:58.825588 44864 deprecation.py:323] From c:\users\user\appdata\local\programs\python\python37\lib\site-packages\tensorflow\contrib\learn\python\learn\datasets\mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version.
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
InĀ [51]:
print ("The training data set is:\n")
print (mnist.train.images.shape)
print (mnist.train.labels.shape)
The training data set is:

(55000, 784)
(55000, 10)
InĀ [52]:
print ("The test data set is:")
print (mnist.test.images.shape)
print (mnist.test.labels.shape)
The test data set is:
(10000, 784)
(10000, 10)

Display a few random samples from it:

InĀ [53]:
mnist.train.images[5]
Out[53]:
array([0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.23137257, 0.6392157 , 0.9960785 , 0.9960785 , 0.9960785 ,
       0.7607844 , 0.43921572, 0.07058824, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.01568628, 0.5176471 , 0.93725497, 0.9921569 ,
       0.9921569 , 0.9921569 , 0.9921569 , 0.9960785 , 0.9921569 ,
       0.627451  , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.5372549 ,
       0.9921569 , 0.9960785 , 0.9921569 , 0.9921569 , 0.9921569 ,
       0.75294125, 0.9960785 , 0.9921569 , 0.8980393 , 0.0509804 ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.01568628, 0.5372549 , 0.9843138 , 0.9921569 , 0.9568628 ,
       0.50980395, 0.19215688, 0.07450981, 0.01960784, 0.6392157 ,
       0.9921569 , 0.8235295 , 0.03529412, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.37254903, 0.9921569 ,
       0.9921569 , 0.8431373 , 0.1764706 , 0.        , 0.        ,
       0.        , 0.        , 0.6117647 , 0.9921569 , 0.68235296,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.8431373 , 0.9960785 , 0.8117648 , 0.09019608,
       0.        , 0.        , 0.        , 0.03921569, 0.3803922 ,
       0.85098046, 0.9176471 , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.83921576,
       0.9921569 , 0.2784314 , 0.        , 0.        , 0.00784314,
       0.19607845, 0.8352942 , 0.9921569 , 0.9960785 , 0.7058824 ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.83921576, 0.9921569 , 0.19215688,
       0.        , 0.        , 0.19607845, 0.9921569 , 0.9921569 ,
       0.9921569 , 0.7176471 , 0.04705883, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.7803922 , 0.9921569 , 0.95294124, 0.7686275 , 0.62352943,
       0.95294124, 0.9921569 , 0.9686275 , 0.5411765 , 0.03137255,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.16470589, 0.9921569 ,
       0.9921569 , 0.9921569 , 0.9960785 , 0.9921569 , 0.9921569 ,
       0.39607847, 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.23137257, 0.58431375, 0.9960785 , 0.9960785 , 0.9960785 ,
       1.        , 0.9960785 , 0.6862745 , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.13333334, 0.75294125, 0.9960785 , 0.9921569 ,
       0.9921569 , 0.9921569 , 0.7843138 , 0.53333336, 0.89019614,
       0.9450981 , 0.27058825, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.33333334, 0.9686275 ,
       0.9921569 , 0.9960785 , 0.9921569 , 0.77647066, 0.48235297,
       0.07058824, 0.        , 0.19607845, 0.9921569 , 0.8352942 ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.2784314 , 0.9686275 , 0.9921569 , 0.9294118 , 0.75294125,
       0.2784314 , 0.02352941, 0.        , 0.        , 0.        ,
       0.00784314, 0.5019608 , 0.9803922 , 0.21176472, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.46274513, 0.9921569 ,
       0.8705883 , 0.14117648, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.03137255, 0.7176471 ,
       0.9921569 , 0.227451  , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.46274513, 0.9960785 , 0.54509807, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.05490196, 0.7294118 , 0.9960785 , 0.9960785 , 0.227451  ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.2784314 ,
       0.9686275 , 0.9686275 , 0.54509807, 0.0627451 , 0.        ,
       0.        , 0.07450981, 0.227451  , 0.87843144, 0.9921569 ,
       0.9921569 , 0.8313726 , 0.03529412, 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.42352945, 0.9921569 ,
       0.9921569 , 0.92549026, 0.6862745 , 0.6862745 , 0.9686275 ,
       0.9921569 , 0.9960785 , 0.9921569 , 0.77647066, 0.16862746,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.2627451 , 0.8352942 , 0.8980393 , 0.9960785 ,
       0.9921569 , 0.9921569 , 0.9921569 , 0.9921569 , 0.83921576,
       0.48627454, 0.02352941, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.09019608, 0.60784316, 0.60784316, 0.8745099 ,
       0.7843138 , 0.46274513, 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ,
       0.        , 0.        , 0.        , 0.        ], dtype=float32)
InĀ [54]:
# well, that's not a picture (or image), it's an array.

mnist.train.images[5].shape
Out[54]:
(784,)

You might think the training set is made up of 28 $\times$28 grayscale images of handwritten digits. No !!!

The thing is, the image has been flattened. These are 28x28 images that have been flattened into a 1D array. Let's reshape one.

InĀ [55]:
img = np.reshape(mnist.train.images[5], [28,28])
InĀ [56]:
img = mnist.train.images[5].reshape([28,28])
InĀ [57]:
# So now we have a 28x28 matrix, where each element is an intensity level from 0 to 1.  
img.shape
Out[57]:
(28, 28)

Let's visualize what some of these images and their corresponding training labels look like.

InĀ [58]:
plt.figure(figsize = (6,6))
plt.imshow(img, 'gray')
plt.xticks([])
plt.yticks([])
plt.show()
InĀ [59]:
mnist.train.labels[5]
Out[59]:
array([0., 0., 0., 0., 0., 0., 0., 0., 1., 0.])
InĀ [60]:
np.argmax(mnist.train.labels[5])
Out[60]:
8

Batch maker embedded

InĀ [61]:
x, y = mnist.train.next_batch(3)

print(x.shape)
print(y.shape)
(3, 784)
(3, 10)

6.2. ANN with TensorFlow¶

  • Feed a gray image to ANN


  • Our network model



- Network training (learning) $$\omega:= \omega - \alpha \nabla_{\omega} \left( h_{\omega} \left(x^{(i)}\right),y^{(i)}\right)$$

6.2.1. Import Library¶

InĀ [62]:
# Import Library
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

6.2.2. Load MNIST Data¶

  • Download MNIST data from tensorflow tutorial example
InĀ [63]:
from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
InĀ [64]:
train_x, train_y = mnist.train.next_batch(1)
img = train_x[0,:].reshape(28,28)

plt.figure(figsize=(6,6))
plt.imshow(img,'gray')
plt.title("Label : {}".format(np.argmax(train_y[0,:])))
plt.xticks([])
plt.yticks([])
plt.show()

One hot encoding

InĀ [65]:
print ('Train labels : {}'.format(train_y[0, :]))
Train labels : [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]

6.2.3. Define an ANN Structure¶

  • Input size
  • Hidden layer size
  • The number of classes


InĀ [66]:
n_input = 28*28
n_hidden = 100
n_output = 10

6.2.4. Define Weights, Biases, and Placeholder¶

  • Define parameters based on predefined layer size
  • Initialize with normal distribution with $\mu = 0$ and $\sigma = 0.1$
InĀ [67]:
weights = {
    'hidden' : tf.Variable(tf.random_normal([n_input, n_hidden], stddev = 0.1)),
    'output' : tf.Variable(tf.random_normal([n_hidden, n_output], stddev = 0.1))
}

biases = {
    'hidden' : tf.Variable(tf.random_normal([n_hidden], stddev = 0.1)),
    'output' : tf.Variable(tf.random_normal([n_output], stddev = 0.1))
}
InĀ [68]:
x = tf.placeholder(tf.float32, [None, n_input])
y = tf.placeholder(tf.float32, [None, n_output])

6.2.5. Build a Model¶

First, the layer performs several matrix multiplication to produce a set of linear activations



$$y_j = \left(\sum\limits_i \omega_{ij}x_i\right) + b_j$$$$\mathcal{y} = \omega^T \mathcal{x} + \mathcal{b}$$


Second, each linear activation is running through a nonlinear activation function




Third, predict values with an affine transformation



InĀ [69]:
# Define Network
def build_model(x, weights, biases):
    
    # first hidden layer
    hidden = tf.add(tf.matmul(x, weights['hidden']), biases['hidden'])
    # non-linear activate function
    hidden = tf.nn.relu(hidden)
    
    # Output layer 
    output = tf.add(tf.matmul(hidden, weights['output']), biases['output'])
    
    return output

6.2.6. Define Loss and Optimizer¶

Loss

  • This defines how we measure how accurate the model is during training. As was covered in lecture, during training we want to minimize this function, which will "steer" the model in the right direction.
  • Classification: Cross entropy
    • Equivalent to apply logistic regression
$$ -\frac{1}{m}\sum_{i=1}^{m}y^{(i)}\log(h_{\theta}\left(x^{(i)}\right)) + (1-y^{(i)})\log(1-h_{\theta}\left(x^{(i)}\right)) $$

Optimizer

  • This defines how the model is updated based on the data it sees and its loss function.
  • AdamOptimizer: the most popular optimizer
InĀ [70]:
# Define Loss
pred = build_model(x, weights, biases)
loss = tf.nn.softmax_cross_entropy_with_logits(logits = pred, labels = y)
loss = tf.reduce_mean(loss)

LR = 0.0001
optm = tf.train.AdamOptimizer(LR).minimize(loss)

6.2.7. Define Optimization Configuration and Then Optimize¶




  • Define parameters for training ANN
    • n_batch: batch size for mini-batch gradient descent
    • n_iter: the number of iteration steps
    • n_prt: check loss for every n_prt iteration
  • Metrics
    • Here we can define metrics used to monitor the training and testing steps. In this example, we'll look at the accuracy, the fraction of the images that are correctly classified.

Initializer

  • Initialize all the variables
InĀ [71]:
n_batch = 50     # Batch Size
n_iter = 5000    # Learning Iteration
n_prt = 250      # Print Cycle
InĀ [72]:
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

loss_record_train = []
loss_record_test = []
for epoch in range(n_iter):
    train_x, train_y = mnist.train.next_batch(n_batch)
    sess.run(optm, feed_dict = {x: train_x, y: train_y}) 
    
    if epoch % n_prt == 0:        
        test_x, test_y = mnist.test.next_batch(n_batch)
        c1 = sess.run(loss, feed_dict = {x: train_x, y: train_y})
        c2 = sess.run(loss, feed_dict = {x: test_x, y: test_y})
        loss_record_train.append(c1)
        loss_record_test.append(c2)
        print ("Iter : {}".format(epoch))
        print ("Cost : {}".format(c1))
        
plt.figure(figsize=(10,8))
plt.plot(np.arange(len(loss_record_train))*n_prt, 
         loss_record_train, label = 'training')
plt.plot(np.arange(len(loss_record_test))*n_prt, 
         loss_record_test, label = 'testing')
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.legend(fontsize = 12)
plt.ylim([0, np.max(loss_record_train)])
plt.show()
Iter : 0
Cost : 2.4312589168548584
Iter : 250
Cost : 1.0510585308074951
Iter : 500
Cost : 0.7897679805755615
Iter : 750
Cost : 0.622959315776825
Iter : 1000
Cost : 0.4856872856616974
Iter : 1250
Cost : 0.5245269536972046
Iter : 1500
Cost : 0.42817023396492004
Iter : 1750
Cost : 0.32486262917518616
Iter : 2000
Cost : 0.3275931477546692
Iter : 2250
Cost : 0.4030267298221588
Iter : 2500
Cost : 0.28863057494163513
Iter : 2750
Cost : 0.2863714396953583
Iter : 3000
Cost : 0.2694478929042816
Iter : 3250
Cost : 0.34466609358787537
Iter : 3500
Cost : 0.37973716855049133
Iter : 3750
Cost : 0.34435853362083435
Iter : 4000
Cost : 0.18402183055877686
Iter : 4250
Cost : 0.16266214847564697
Iter : 4500
Cost : 0.40091052651405334
Iter : 4750
Cost : 0.12686783075332642

6.2.8. Test or Evaluate¶

InĀ [73]:
test_x, test_y = mnist.test.next_batch(100)

my_pred = sess.run(pred, feed_dict = {x : test_x})
my_pred = np.argmax(my_pred, axis = 1)

labels = np.argmax(test_y, axis = 1)

accr = np.mean(np.equal(my_pred, labels))
print("Accuracy : {}%".format(accr*100))
Accuracy : 95.0%
InĀ [74]:
test_x, test_y = mnist.test.next_batch(1)
logits = sess.run(tf.nn.softmax(pred), feed_dict = {x : test_x})
predict = np.argmax(logits)

plt.figure(figsize = (6,6))
plt.imshow(test_x.reshape(28,28), 'gray')
plt.xticks([])
plt.yticks([])
plt.show()

print('Prediction : {}'.format(predict))
np.set_printoptions(precision = 2, suppress = True)
print('Probability : {}'.format(logits.ravel()))
Prediction : 4
Probability : [0.   0.   0.   0.   0.88 0.   0.   0.01 0.04 0.07]

You may observe that the accuracy on the test dataset is a little lower than the accuracy on the training dataset. This gap between training accuracy and test accuracy is an example of overfitting, when a machine learning model performs worse on new data than on its training data.

What is the highest accuracy you can achieve with this first fully connected model? Since the handwritten digit classification task is pretty straightforward, you may be wondering how we can do better...

$\Rightarrow$ As we saw in lecture, convolutional neural networks (CNNs) are particularly well-suited for a variety of tasks in computer vision, and have achieved near-perfect accuracies on the MNIST dataset. We will build a CNN and ultimately output a probability distribution over the 10 digit classes (0-9) in the next lectures.

7. Autoencoder¶

7.1. Unsupervised Learning¶


Definition

  • Unsupervised learning refers to most attempts to extract information from a distribution that do not require human labor to annotate example
  • Main task is to find the 'best' representation of the data

Dimension Reduction

  • Attempt to compress as much information as possible in a smaller representation
  • Preserve as much information as possible while obeying some constraint aimed at keeping the representation simpler

7.2. Autoencoders¶

It is like 'deep learning version' of unsupervised learning.


Definition

  • An autoencoder is a neural network that is trained to attempt to copy its input to its output
  • The network consists of two parts: an encoder and a decoder that produce a reconstruction


Encoder and Decoder

  • Encoder function : $z = f(x)$
  • Decoder function : $x = g(z)$
  • We learn to set $g\left(f(x)\right) = x$





  • Autoencoder combines an encoder $f$ from the original space $\mathscr{X}$ to a latent space $\mathscr{F}$, and a decoder $g$ to map back to $\mathscr{X}$, such that $f \circ g$ is [close to] the identity on the data


$$ \mathbb{E} \left[ \lVert X - g \circ f(X) \rVert^2 \right] \approx 0$$



  • A proper autoencoder has to capture a "good" parametrization of the signal, and in particular the statistical dependencies between the signal components.

7.3. Autoencoder with TensorFlow¶

  • MNIST example
  • Use only (1, 5, 6) digits to visualize in 2-D



7.3.1. Import Library¶

InĀ [75]:
# import os
# os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
# os.environ["CUDA_VISIBLE_DEVICES"]="0"
InĀ [76]:
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
%matplotlib inline

7.3.2. Load MNIST Data¶

InĀ [77]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
  • Only use (1, 5, 6) digits to visualize latent space in 2-D
InĀ [78]:
train_idx = ((np.argmax(mnist.train.labels, 1) == 1) | \
             (np.argmax(mnist.train.labels, 1) == 5) | \
             (np.argmax(mnist.train.labels, 1) == 6))
test_idx = ((np.argmax(mnist.test.labels, 1) == 1) | \
            (np.argmax(mnist.test.labels, 1) == 5) | \
            (np.argmax(mnist.test.labels, 1) == 6))

train_imgs   = mnist.train.images[train_idx]
train_labels = mnist.train.labels[train_idx]
test_imgs    = mnist.test.images[test_idx]
test_labels  = mnist.test.labels[test_idx]
n_train      = train_imgs.shape[0]
n_test       = test_imgs.shape[0]

print ("The number of training images : {}, shape : {}".format(n_train, train_imgs.shape))
print ("The number of testing images : {}, shape : {}".format(n_test, test_imgs.shape))
The number of training images : 16583, shape : (16583, 784)
The number of testing images : 2985, shape : (2985, 784)

7.3.3. Define a Structure of an Autoencoder¶

  • Input shape and latent variable shape
  • Encoder shape
  • Decoder shape


InĀ [79]:
# Shape of input and latent variable

n_input = 28*28

# Encoder structure
n_encoder1 = 500
n_encoder2 = 300

n_latent = 2

# Decoder structure
n_decoder2 = 300
n_decoder1 = 500

7.3.4. Define Weights, Biases, and Placeholder¶

  • Define weights and biases for encoder and decoder, separately
  • Based on the pre-defined layer size
  • Initialize with normal distribution of $\mu=0$ and $\sigma=0.1$
InĀ [80]:
weights = {
    'encoder1' : tf.Variable(tf.random_normal([n_input, n_encoder1], stddev = 0.1)),
    'encoder2' : tf.Variable(tf.random_normal([n_encoder1, n_encoder2], stddev = 0.1)),
    'latent' : tf.Variable(tf.random_normal([n_encoder2, n_latent], stddev = 0.1)),
    'decoder2' : tf.Variable(tf.random_normal([n_latent, n_decoder2], stddev = 0.1)),
    'decoder1' : tf.Variable(tf.random_normal([n_decoder2, n_decoder1], stddev = 0.1)),
    'reconst' : tf.Variable(tf.random_normal([n_decoder1, n_input], stddev = 0.1))
}

biases = {
    'encoder1' : tf.Variable(tf.random_normal([n_encoder1], stddev = 0.1)),
    'encoder2' : tf.Variable(tf.random_normal([n_encoder2], stddev = 0.1)),
    'latent' : tf.Variable(tf.random_normal([n_latent], stddev = 0.1)),
    'decoder2' : tf.Variable(tf.random_normal([n_decoder2], stddev = 0.1)),
    'decoder1' : tf.Variable(tf.random_normal([n_decoder1], stddev = 0.1)),
    'reconst' : tf.Variable(tf.random_normal([n_input], stddev = 0.1))
}
InĀ [81]:
x = tf.placeholder(tf.float32, [None, n_input])

7.3.5. Build a Model¶

Encoder

  • Simple ANN (MLP) model
  • Use tanh for a nonlinear activation function
  • latent is not applied with a nonlinear activation function

Decoder

  • Simple ANN (MLP) model
  • Use tanh for a nonlinear activation function
  • reconst is not applied with a nonlinear activation function


InĀ [82]:
def encoder(x, weights, biases):
    encoder1 = tf.add(tf.matmul(x, weights['encoder1']), biases['encoder1'])
    encoder1 = tf.nn.tanh(encoder1)
    
    encoder2 = tf.add(tf.matmul(encoder1, weights['encoder2']), biases['encoder2'])
    encoder2 = tf.nn.tanh(encoder2)
    
    latent = tf.add(tf.matmul(encoder2, weights['latent']), biases['latent'])

    return latent
InĀ [83]:
def decoder(latent, weights, biases):
    decoder2 = tf.add(tf.matmul(latent, weights['decoder2']), biases['decoder2'])
    decoder2 = tf.nn.tanh(decoder2)
    
    decoder1 = tf.add(tf.matmul(decoder2, weights['decoder1']), biases['decoder1'])
    decoder1 = tf.nn.tanh(decoder1)
    
    reconst = tf.add(tf.matmul(decoder1, weights['reconst']), biases['reconst'])
   
    return reconst

7.3.6. Define Loss and Optimizer¶

Loss

  • Squared loss
$$ \frac{1}{m}\sum_{i=1}^{m} (t_{i} - y_{i})^2 $$

Optimizer

  • AdamOptimizer: the most popular optimizer
InĀ [84]:
LR = 0.0001

latent = encoder(x, weights, biases)
reconst = decoder(latent, weights, biases)
loss = tf.square(tf.subtract(x, reconst))
loss = tf.reduce_mean(loss)

optm = tf.train.AdamOptimizer(LR).minimize(loss)

7.3.7. Define Optimization Configuration and Then Optimize¶



  • Define parameters for training autoencoder
    • n_batch : batch size for mini-batch gradient descent
    • n_iter : the number of iteration steps
    • n_prt : check loss for every n_prt iteration
InĀ [85]:
n_batch = 50
n_iter = 2500
n_prt = 250
InĀ [86]:
def train_batch_maker(batch_size):
    random_idx = np.random.randint(n_train, size = batch_size)
    return train_imgs[random_idx], train_labels[random_idx]
InĀ [87]:
def test_batch_maker(batch_size):
    random_idx = np.random.randint(n_test, size = batch_size)
    return test_imgs[random_idx], test_labels[random_idx]
InĀ [88]:
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

loss_record_train = []
loss_record_test = []
for epoch in range(n_iter):
    train_x, _ = train_batch_maker(n_batch)
    sess.run(optm, feed_dict = {x : train_x})  
    
    if epoch % n_prt == 0:
        test_x, _ = test_batch_maker(n_batch)
        c1 = sess.run(loss, feed_dict = {x: train_x})
        c2 = sess.run(loss, feed_dict = {x: test_x})
        loss_record_train.append(c1)
        loss_record_test.append(c2)
        print ("Iter : {}".format(epoch))
        print ("Cost : {}".format(c1))
        
plt.figure(figsize=(10,8))
plt.plot(np.arange(len(loss_record_train))*n_prt, loss_record_train, label = 'training')
plt.plot(np.arange(len(loss_record_test))*n_prt, loss_record_test, label = 'testing')
plt.xlabel('iteration', fontsize = 15)
plt.ylabel('loss', fontsize = 15)
plt.legend(fontsize = 12)
plt.ylim([0,np.max(loss_record_train)])
plt.show()
Iter : 0
Cost : 0.4592665433883667
Iter : 250
Cost : 0.04549552500247955
Iter : 500
Cost : 0.042882248759269714
Iter : 750
Cost : 0.04552707076072693
Iter : 1000
Cost : 0.04222516342997551
Iter : 1250
Cost : 0.03970355540513992
Iter : 1500
Cost : 0.03572160378098488
Iter : 1750
Cost : 0.03479230776429176
Iter : 2000
Cost : 0.03626435995101929
Iter : 2250
Cost : 0.034916508942842484

7.3.8. Test or Evaluate¶

  • Test reconstruction performance of the autoencoder
InĀ [89]:
test_x, _ = test_batch_maker(1)
x_reconst = sess.run(reconst, feed_dict = {x: test_x})

plt.figure(figsize = (10,8))
plt.subplot(1,2,1)
plt.imshow(test_x.reshape(28,28), 'gray')
plt.title('Input Image', fontsize = 15)
plt.xticks([])
plt.yticks([])
plt.subplot(1,2,2)
plt.imshow(x_reconst.reshape(28,28), 'gray')
plt.title('Reconstructed Image', fontsize = 15)
plt.xticks([])
plt.yticks([])
plt.show()
  • To see the distribution of latent variables, we make a projection of 784-dimensional image space onto 2-dimensional latent space
InĀ [90]:
test_x, test_y = test_batch_maker(500)
test_y = np.argmax(test_y, axis = 1)
test_latent = sess.run(latent, feed_dict = {x: test_x})

plt.figure(figsize = (10,10))
plt.scatter(test_latent[test_y == 1,0], test_latent[test_y == 1,1], label = '1')
plt.scatter(test_latent[test_y == 5,0], test_latent[test_y == 5,1], label = '5')
plt.scatter(test_latent[test_y == 6,0], test_latent[test_y == 6,1], label = '6')
plt.title('Latent Space', fontsize=15)
plt.xlabel('Z1', fontsize=15)
plt.ylabel('Z2', fontsize=15)
plt.legend(fontsize = 15)
plt.axis('equal')
plt.show()

Data Generation

  • It generates something that makes sense.

  • These results are unsatisfying, because the density model used on the latent space ℱ is too simple and inadequate.

  • Building a ā€œgoodā€ model amounts to our original problem of modeling an empirical distribution, although it may now be in a lower dimension space.

  • This is a motivation to VAE or GAN.

InĀ [91]:
new_data = np.array([[-4, 0]])

latent_input = tf.placeholder(tf.float32, [None, n_latent])
reconst = decoder(latent_input, weights, biases)
fake_image = sess.run(reconst, feed_dict = {latent_input: new_data})

plt.figure(figsize=(16,7))
plt.subplot(1,2,1)
plt.scatter(test_latent[test_y == 1,0], test_latent[test_y == 1,1], label = '1')
plt.scatter(test_latent[test_y == 5,0], test_latent[test_y == 5,1], label = '5')
plt.scatter(test_latent[test_y == 6,0], test_latent[test_y == 6,1], label = '6')
plt.scatter(new_data[:,0], new_data[:,1], c = 'k', marker = 'o', s = 200, label = 'new data')
plt.title('Latent Space', fontsize = 15)
plt.xlabel('Z1', fontsize = 15)
plt.ylabel('Z2', fontsize = 15)
plt.legend(loc = 2, fontsize = 12)
plt.axis('equal')
plt.subplot(1,2,2)
plt.imshow(fake_image.reshape(28,28), 'gray')
plt.title('Generated Fake Image', fontsize = 15)
plt.xticks([])
plt.yticks([])
plt.show()

7.4. Visualization¶

Image Generation

  • Select an arbitrary latent varibale $z$
  • Generate images using the learned decoder
InĀ [92]:
# Initialize canvas
nx = 20
ny = 20
x_values = np.linspace(-8, 4, nx)
y_values = np.linspace(-4, 6, ny)
canvas = np.empty((28*ny, 28*nx))

# Define placeholder
latent_input = tf.placeholder(tf.float32, [None, n_latent])
reconst = decoder(latent_input, weights, biases)

for i, yi in enumerate(y_values):
        for j, xi in enumerate(x_values):
            latent_ = np.array([[xi, yi]])
            reconst_ = sess.run(reconst, feed_dict = {latent_input: latent_})
            canvas[(nx-i-1)*28:(nx-i)*28,j*28:(j+1)*28] = reconst_.reshape(28, 28)

plt.figure(figsize = (16, 7))
plt.subplot(1,2,1)
plt.scatter(test_latent[test_y == 1,0], test_latent[test_y == 1,1], label = '1')
plt.scatter(test_latent[test_y == 5,0], test_latent[test_y == 5,1], label = '5')
plt.scatter(test_latent[test_y == 6,0], test_latent[test_y == 6,1], label = '6')
plt.title('Latent Space', fontsize = 15)
plt.xlabel('Z1', fontsize = 15)
plt.ylabel('Z2', fontsize = 15)
plt.legend(fontsize = 12)
plt.axis('equal')
plt.subplot(1,2,2)
plt.imshow(canvas, 'gray')
plt.title('Manifold', fontsize = 15)
plt.xlabel('Z1', fontsize = 15)
plt.ylabel('Z2', fontsize = 15)
plt.xticks([])
plt.yticks([])
plt.show()

7.5. Latent Representation¶

To get an intuition of the latent representation, we can pick two samples š‘„ and š‘„ā€² at random and interpolate samples along the line in the latent space

$$g((1-\alpha)f(x) + \alpha f(x'))$$



  • Interpolation in High Dimension



  • Interpolation in Manifold



InĀ [93]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')